ggplot2 is the quickest and reliable package in R for making graphsggplot2 and ggthemes in the following wayIf you get an error that indicates that a package does not exist, you should install these packages:
We can start off with an example, examining the relationship between urbanization and life expectancy
The research question is: Do countries with higher levels of urbanization also have higher life expectancy?
What does the relationship between urbanization life expectancy look like? Positive? Negative? Nonlinear?
Does the relationship vary by continent?
The data is available at: Life expectancy and Urbanization Data
The dataframe once you load it into R looks like the following:
setwd("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/big_data/week4/data/")
#Step1: Loading the data
life_exp_urb <- read.csv(file = './life_exp_urb.csv')
#Step2: Examining the first five entries
head(life_exp_urb, n=5) Entity life_expectancy urbanization type
1 Aruba 70.25972 48.05939 Everything Else
2 Afghanistan 45.38333 18.61175 Everything Else
3 Angola 45.08466 37.53970 Everything Else
4 Anguilla 69.44028 NA Everything Else
5 Albania 68.28611 40.44416 Everything Else
Entity life_expectancy urbanization type
1 Aruba 70.25972 48.05939 Everything Else
2 Afghanistan 45.38333 18.61175 Everything Else
3 Angola 45.08466 37.53970 Everything Else
A good way way to examine the dataframe is
Entity life_expectancy urbanization type
1 Aruba 70.25972 48.05939 Everything Else
2 Afghanistan 45.38333 18.61175 Everything Else
3 Angola 45.08466 37.53970 Everything Else
or
Rows: 237
Columns: 4
$ Entity <chr> "Aruba", "Afghanistan", "Angola", "Anguilla", "Albania…
$ life_expectancy <dbl> 70.25972, 45.38333, 45.08466, 69.44028, 68.28611, 77.0…
$ urbanization <dbl> 48.059393, 18.611754, 37.539705, NA, 40.444164, 87.043…
$ type <chr> "Everything Else", "Everything Else", "Everything Else…
Rows: 237
Columns: 4
$ Entity <chr> "Aruba", "Afghanistan", "Angola", "Anguilla", "Albania…
$ life_expectancy <dbl> 70.25972, 45.38333, 45.08466, 69.44028, 68.28611, 77.0…
$ urbanization <dbl> 48.059393, 18.611754, 37.539705, NA, 40.444164, 87.043…
$ type <chr> "Everything Else", "Everything Else", "Everything Else…
Among the variables, we have:
Entity - country name - character or string variable
life_expectancy - average life expectancy - double-precision floating variable
urbanization - percentage level of urbanization - double-precision floating variable
type - group of countries - EU, Latin America, Everything Else - character or string variable
The final goal is to obtain a graph like this:
The first step is to define a plot object and add layers to it
The second step is to add layers
The mapping argument of the ggplot() function defines how variables in your dataset are mapped to visual properties (aesthetics) of your plot.
The mapping specifies the x and y
The third step is to define a geometrical object - geom to plot the data
There are different types of geometrical objects
geom_bar() - bar geoms
geom_line() - line geoms
geom_boxplot() - boxplot geoms
geom_point() - point geoms
In our case, we will use geom_point()
We now have something that looks like scatterplot
There appears to be a positive relationship between urbanization and life expectancy: more urbanization means higher life expectancy
Countries with higher urbanization have higher life expectancy
The fourth step is to add aesthetics
Could the relationship between urbanization and life expectancy depend on the type of countries: e.g. EU, Latin America, the rest of the world?
To add more visual clarity about the relationship between urbanization and life expectancy on a continent level, we can include a smooth curve
We can a global fitting line by using the following code:
Note the difference between the two codes:
To make the difference between different groups more obvious, we can choose different shapes.
Thus, we can also map type to the shape aesthetic in addition to color.
We can improve our graph by adding labels by using the labs() function
ggplot2 can also be used to visualize categorical and numerical variables.
For example, we can think of type as a categorical variable:
In other words, whether this is the EU, Latin America, or everything else, this would be a categorical variable
Categorical Variables - can take on a limited number of possible values. Each observation is associated to a particular group based on some qualitative property.
This is how we can represent a barplot
We can order the bars in descending order using the following
Numerical Variables can take on a wide range of numerical values
One come way to visualize numerical variables is with the help of distributions
A histogram divides the x-axis (horizontal) into equally spaced bins
It then uses the height of a bar to display the number of observations that fall in each bin.
This specific histograms shows that the majority of the countries the sample have an average life expectancy of around 66 and 68 years.
We can set the width of the intervals in a histogram with the binwidth argument.
This is measured in the units of the x variable.
Let us look at differences between binsizes
An alternative to histograms is a density plot
This is a smoothed-out version of a histogram
To visualize relationships, we need at least two variables mapped to aesthetics
A boxplot is a type of visual shorthand for measures of position (percentiles) that describe a distribution.
Let us now look at our own dataset
We can also create a density plot
We can use the color, fill, and alpha aesthetics to o add transparency to the filled density curves
Previously, we looked at one way to present the relationship between urbanization and life expectancy, continent by continent
Another (more effective) way is to plot facets
Here is how the two compare
One Graph
Once you created your plot, you should save your files with ggsave
We should first provide a name for the object
And then save that object
This will be saved in your working directory.
You can also be more precise about about the dimensions of your figure
You may also want to plot temporal data
To do that we need to load the life expectancy data over time
The data is available at: Life expectancy
Let us look again at the data
Let us imagine that we want to plot life expectancy over time
Let us look at the transformed data.
We should now create a simple timeplot of life expectancy over time
Up until 1950, there is a lot of volatility in life expectancy
To view the upward sloping trend in life expectancy, it might be worth aggregating some of the years.
Let us imagine that we want to compare Italy and the US in life expectancy
This means we go back to the original file and select the US and Italy
setwd("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/big_data/week4/data/")
#Step1: Loading the data
life_expectancy <- read.csv(file = './life-expectancy.csv')
#Step2: Subsetting the data
life_expectancy_itus<-subset(life_expectancy, Entity %in% c("Italy", "United States"))
#Step3: Renaming Variables
names(life_expectancy_itus)[names(life_expectancy_itus)=="Life.expectancy.at.birth..historical."] <- "life_exp"We can now plot life expectancy for the two countries over time
#Step4: Subsetting the data
life_expectancy_itus<-subset(life_expectancy, Entity %in% c("Italy", "United States"))
#Step5: Renaming Variables
names(life_expectancy_itus)[names(life_expectancy_itus)==
"Life.expectancy.at.birth..historical."] <- "life_exp"
ggplot(life_expectancy_itus,
aes(x=Year, y= life_exp, color = Entity))+
geom_line()Popescu (JCU): Lecture 5